117 research outputs found

    Paradigm Completion for Derivational Morphology

    Full text link
    The generation of complex derived word forms has been an overlooked problem in NLP; we fill this gap by applying neural sequence-to-sequence models to the task. We overview the theoretical motivation for a paradigmatic treatment of derivational morphology, and introduce the task of derivational paradigm completion as a parallel to inflectional paradigm completion. State-of-the-art neural models, adapted from the inflection task, are able to learn a range of derivation patterns, and outperform a non-neural baseline by 16.4%. However, due to semantic, historical, and lexical considerations involved in derivational morphology, future work will be needed to achieve performance parity with inflection-generating systems.Comment: EMNLP 201

    Marrying Universal Dependencies and Universal Morphology

    Full text link
    The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages - UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. With compatibility of tags, each project's annotations could be used to validate the other's. Additionally, the availability of both type- and token-level resources would be a boon to tasks such as parsing and homograph disambiguation. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema. We validate our approach by lookup in the UniMorph corpora and find a macro-average of 64.13% recall. We also note incompatibilities due to paucity of data on either side. Finally, we present a critical evaluation of the foundations, strengths, and weaknesses of the two annotation projects.Comment: UDW1

    Quantifying the value of pronunciation lexicons for keyword search in low resource languages

    Get PDF
    ABSTRACT This paper quantifies the value of pronunciation lexicons in large vocabulary continuous speech recognition (LVCSR) systems that support keyword search (KWS) in low resource languages. Stateof-the-art LVCSR and KWS systems are developed for conversational telephone speech in Tagalog, and the baseline lexicon is augmented via three different grapheme-to-phoneme models that yield increasing coverage of a large Tagalog word-list. It is demonstrated that while the increased lexical coverage -or reduced out-of-vocabulary (OOV) rate -leads to only modest (ca 1%-4%) improvements in word error rate, the concomitant improvements in actual term weighted value are as much as 60%. It is also shown that incorporating the augmented lexicons into the LVCSR system before indexing speech is superior to using them post facto, e.g., for approximate phonetic matching of OOV keywords in pre-indexed lattices. These results underscore the disproportionate importance of automatic lexicon augmentation for KWS in morphologically rich languages, and advocate for using them early in the LVCSR stage. Index Terms-Speech Recognition, Keyword Search, Information Retrieval, Morphology, Speech Synthesis LOW-RESOURCE KEYWORD SEARCH Thanks in part to the falling costs of storage and transmission, large volumes of speech such as oral history archives [1, 2] and on-line lectures We are interested in improving KWS performance in a low resource setting, i.e. where some resources are available to develop The authors, listed here in alphabetical order, were supported by DARPA BOLT contract Nō HR0011-12-C-0015, and IARPA BABEL contract Nō W911NF-12-C-0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoD/ARL or the U.S. Government. an LVCSR system -such as 10 hours of transcribed speech corresponding to about 100K words of transcribed text, and a pronunciation lexicon that covers the words in the training data -but accuracy is sufficiently low that considerable improvement in KWS performance is necessary before the system is usable for searching a speech collection. A fair amount of past research has been devoted to improving the acoustic models from un-transcribed speech The importance of pronunciation lexicons for LVCSR is not entirely underestimated. Several papers have addressed the problem of automatically generating pronunciations for out of vocabulary (OOV) words Two notable exceptions to this conventional wisdom are (i) accuracy on infrequent, content-bearing words, which are more likely to be OOV, and (ii) accuracy in morphologically rich languages, e.g. Czech and Turkish. These exceptions come together in a detrimental fashion when developing KWS systems for a morphologically rich, low resource language such as Tagalog. This is the setting in which we will quantify the impact of increasing lexical coverage on the performance of a KWS system. We assume a transcribed corpus of 10 hours of Tagalog conversational telephone speech We first develop state-of-the-art LVCSR and KWS systems based on the given resources. We process and index a 10 hour search collection using the KWS system, and measure KWS performance using a set of 355 Tagalog queries. We then explore three different methods for augmenting the 5.7K word lexicon to include additional words seen in the larger LM training corpus. The augmented lexicons are used to improve the KWS system in two different ways: reprocessing the speech with the larger lexicon, or using it during keyword search. The efficacy of the augmented lexicons is measured in terms of 8560 978-1-4799-0356-6/13/$31.0
    corecore